fix(ocap-kernel): enforce one delivery per crank, fix rollback cache staleness#879
fix(ocap-kernel): enforce one delivery per crank, fix rollback cache staleness#879
Conversation
…staleness
- Restructure run queue generator to yield exactly one item per
startCrank/endCrank pair, preventing rollback from undoing
unrelated earlier deliveries in the same crank
- Refresh StoredQueue after rollback so cached head/tail pointers
are re-read from DB, fixing dequeue returning undefined
- Invalidate runQueueLengthCache after rollback
- Bypass VatManager.terminateVat() in KernelQueue callback to avoid
waitForCrank() deadlock when terminating from within a crank
- Handle vanished endpoints in KernelRouter.deliverSend with
try/catch, treating as splat instead of crashing
- Change KernelQueue subscriptions to {resolve, reject} so aborted
sends can reject the caller's JS promise immediately
- Distinguish rejected vs fulfilled in invokeKernelSubscription
- Improve splat error messages to describe cause without leaking
internal identifiers (krefs, endpoint IDs)
- Add integration test for orphaned ephemeral exo rejection
- Standardize KernelQueue test loop-exit pattern using sentinel
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… in peer-wallet tests
Coverage Report
File Coverage |
There was a problem hiding this comment.
I think I see what was going wrong here, and if my analysis is right, this PR does not address it at all. The issue is that the crank transaction is being driven inside the runQueueItems generator function, which is only responsible for pulling the next item off the run queue. The actual delivery happens in run, which iterates over the stream produced by runQueueItems, but the latter commits the transaction before the delivery has even happened. Somehow the refactoring that moved the run queue processing loop out of Kernel.ts and into KernelQueue.ts mangled this. I don't understand by run is in KernelQueue.ts at all.
| // Queue empty — sleep until woken | ||
| const { promise, resolve } = makePromiseKit<void>(); | ||
| if (this.#wakeUpTheRunQueue !== null) { | ||
| Fail`run queue already waiting to be woken; cannot sleep again before the previous wake handler is consumed`; |
There was a problem hiding this comment.
Even though it would be both technically wrong and wildly inappropriate, I somehow feel the urge to change the error message to "Can't sleep. Clowns will eat me."
Simplifies the implementation of the kernel's run loop in a purely behavioral refactor. The previous async generator + loop iteration has been unwrapped into a single loop with multiple helper functions. I noticed that the startCrank() call is the only part of the run loop that can throw an uncaught exception, and made a note to investigate that later. An unrelated TODO comment is also added to the kernel router.
|
@cursor review |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.
| * | ||
| * @yields the next item in the run queue. | ||
| */ | ||
| async *#runQueueItems(): AsyncGenerator<RunQueueItem> { |
There was a problem hiding this comment.
Crank lifecycle wraps empty queue check unnecessarily
Medium Severity
startCrank() and createCrankSavepoint('start') are called before checking whether the queue has an item. When the queue is empty, a phantom crank is started, a DB savepoint is created, and then endCrank() immediately releases it — all without any delivery. This violates the PR's stated invariant that a crank consists of exactly one delivery. The startCrank/createCrankSavepoint calls belong inside the else branch (after confirming a queue item exists), not before the emptiness check. As @FUDCo noted: "It's ending the crank before the delivery even happens."
Additional Locations (1)
There was a problem hiding this comment.
Unfortunately #getNextRunQueueItem() mutates the kernel store if there is an item on the run queue, so we have to start the crank and create the save point before calling it.
| this.#kernelStore.startCrank(); | ||
| let wakeUpPromise: Promise<void> | undefined; | ||
|
|
||
| this.#kernelStore.startCrank(); |
There was a problem hiding this comment.
startCrank() and endCrank() (down in the finally block) are the only parts of the run loop that can throw uncaught errors. This behavior was pre-existing. Is it what we want?
| if (this.#kernelStore.runQueueLength() > 0) { | ||
| const item = this.#kernelStore.dequeueRun(); | ||
| if (item) { | ||
| return item; | ||
| } | ||
| } | ||
| return undefined; |
There was a problem hiding this comment.
ATTN: in the previous implementation, GC actions, reap actions, and run queue items were processed in this loop:
- All GC actions
- All reap actions
- All run queue items
Now, they are processed in this loop:
- All GC actions
- All reap actions
- One run queue item
Everything obviously appears to work, but we should also convince ourselves that it's correct.
Add "kernel promise" entry distinguishing kernel promises from JS promises, and "decider" entry with function call analogy. Update existing entries to specify "kernel promise" where applicable. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
FUDCo
left a comment
There was a problem hiding this comment.
Based on reviewing the recent commits and our conversation earlier today, I think this is OK to go. I do have one pedantic quibble about a glossary entry. Feel free to leave it as is and take care of it later, of if you want to fix it now I promise quick turn around on a rubber stamp.
docs/glossary.md
Outdated
| runtime environment for vat code and handles object persistence, promise management, and | ||
| [syscall](#syscall) coordination. | ||
| runtime environment for vat code and handles object persistence, [kernel | ||
| promise](#kernel-promise) management, and [syscall](#syscall) coordination. |
There was a problem hiding this comment.
Liveslots doesn't actually manage kernel promises at all, the kernel does (that's why they're called kernel promises). The only promises that get exposed outside the vat by liveslots do, in fact, turn into kernel promises, but liveslots doesn't know anything about this.


As it turns out, we have been violating the invariant that a crank consists of the delivery of a single message or notification. Since at least the introduction of
KernelQueue.tsin #484, one iteration of the kernel's run queue—which should be equivalent to a crank—has actually been able to deliver an unbounded number of messages.This means that, if a delivery aborts mid-crank,
rollbackCrank('start')reverts all deliveries in the crank (including earlier successful ones), creating inconsistency with vat in-memory state and leaving promise subscriptions permanently dangling.This PR ensures that we correctly implement cranks via the kernel's run queue loop as described below.
Summary
whiletoifin KernelQueue generator) and fix staleStoredQueuecaches afterrollbackCrankby refreshing the run queue and invalidatingrunQueueLengthCacheterminateVatcallback in Kernel to avoid deadlock by bypassingVatManager.terminateVat()(which callswaitForCrank())Test plan
KernelQueue.test.ts,KernelRouter.test.ts,crank.test.ts,syscall-validation.test.ts,vat-lifecycle.test.ts)orphaned-ephemeral-exo.test.ts)orphaned-ephemeral-exo.test.tsin kernel-node-runtime)🤖 Generated with Claude Code
Note
High Risk
High risk because it changes core
KernelQueue/KernelRoutercrank semantics, rollback behavior, and how message failures propagate (resolve vs reject), which can affect delivery ordering, retries, and many callers/tests.Overview
Kernel crank semantics are tightened and error propagation is made consistent.
KernelQueue.runis rewritten to process exactly one run-queue item per crank, and JS-side subscriptions created byenqueueMessagenow support bothresolveandrejectso rejected kernel promises reject the returned promise.Rollback and termination handling are hardened.
rollbackCranknow refreshes the stored run-queue and invalidates length caches to avoid stale in-memory state after DB rollback, and abort+terminate paths immediately reject the aborted send’s subscription. Kernel vat termination during a crank bypassesterminateVat()to avoid deadlock.Message “splat” cases are clearer and better handled.
KernelRouterimproves errors for revoked/no-owner/no-object/endpoint-gone cases, resolves splat rejections using the current promise decider, and treats vanished endpoints as a splat with promise rejection.Tests/docs updated and expanded. Many tests are updated to expect promise rejections (including remote comms, revocation, lifecycle), new unit+e2e coverage is added for orphaned ephemeral exos across vat restart,
kernel-utilsexports a newisCapDataguard used to rethrow bootstrap errors as realErrors, and the glossary is expanded/clarified (kernel promises/decider/crank definition).Written by Cursor Bugbot for commit 233587c. This will update automatically on new commits. Configure here.